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Abstract 

We consider a task assignment problem in crowdsourcing, which is aimed at 
collecting as many reliable labels as possible within a limited budget. A challenge 
in this scenario is how to cope with the diversity of tasks and the task-dependent 
reliability of workers, e.g., a worker may be good at recognizing the name of sports 
teams, but not be familiar with cosmetics brands. We refer to this practical setting 
as heterogeneous crowdsourcing. In this paper, we propose a contextual bandit 
formulation for task assignment in heterogeneous crowdsourcing, which is able to 
deal with the exploration-exploitation trade-off in worker selection. We also theo¬ 
retically investigate the regret bounds for the proposed method, and demonstrate 
its practical usefulness experimentally. 


1 Introduction 


The quantity and quality of labeled data significantly affect the performance of machine 
learning algorithms. However, collecting reliable labels from experts is usually expensive 
and time-consuming. The recent emergence of crowdsourcing services such as Amazon 
Mechanical 7Vr/f] (MTurk) enables us to cheaply collect huge quantities of labeled data 
from crowds of wor kers for many machin e learning tasks, e.g., natura l language processing 
( Snow et al. . 2008h and computer vision ( Welinder and Perona . 2010l) . In a crowdsourcing 
system, a requester asks workers to complete a set of labeling tasks by paying them a 
tiny amount of money for each label. 

The primary interest of crowdsourcing research has been how t o cope w ith differ¬ 


ent reliability of worker s and aggregate the collected noisy labels flDawid and Skene 


1979 : Smvth et al. . 1994 : Ipeirotis et al. . 2010| : Ravkar et al. . 2010 : Welinder et al. . 2010 


Yan et al.U2010l:lKaiino et al.U2012nLiu et al.l.l2012tlZhou et al.l.l2012l) . Usually, weighted 
voting mechanism is implicitly or explicitly utilized for label aggregation, with work¬ 
ers’ reliability as weights. Many existing methods use Expectation-Maximization (EM) 


1 https://'www. mturk. com/mturk/ 
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(Dempster et ajJ, 1977) on static datasets of the collected labels to jointly estimate work¬ 


ers’ reliability and true labels. However, how to adaptively collect these labels is often 
neglected. Since the total budget for a requester to pay the workers is usually limited, it 
is necessary to consider how to intelligently use the budget to assign tasks to workers. 

This leads another important line of crowdsourcing research, which is called the task 
routing or task assignment problem. There are two classes of task assignment meth¬ 
ods: push and pull. While in pull methods the system takes a passive role and only 
sets up the environment for workers to find tasks themselves, in push meth ods t he sys¬ 
tem takes complete control over which tasks are assigned to whom ( Law and von Abril . 
2011b In this paper we focus on push methods, and refer to them as task assignment 
methods henceforth. Most of the existing task assignment methods run in an o nline 


mode, simultaneously learning workers’ reli ability and collecting labels (IDonmez et al. 


2009; I Chen et all 120131: lErtekin et all I2014J) . To deal with the exploration (i.e. learning 
which workers are reliable) and exploitation (i.e. selecting the workers considered to be 
reliable) trade-off in worker selection, IEThresh ( Donmez et al. . 20091) and CrowdSense 


(lErtekin et ah, 2014) dynamically sample worker subsets according to workers’ labeling 


performances. However, this is not enough in recent heterogeneous crowdsourcing where 
a worker may be reliable at only a subset of tasks with a certain type. For example, the 
workers considered to be good at previous tasks in IEThresh and CrowdSense may be 
bad at next ones. Therefore, it is more reasonable to model task-dependent reliability for 
workers in heterogeneous crowdsourcing. Another issue in IEThresh and CrowdSense is 
that the budget is not pre-hxed. That is, the requester will not know the total budget 
until the whole task assignment pr ocess ends. This makes those two methods not so prac¬ 
tical for crowdsourcing. OptKG (Chen et ah, 2013) runs within a pre-hxed budget and 
formulates the task assignment problem as a Markov decision process (MDP). However, 
it is difficult to give theoretical guarantees for OptKG when heterogeneous workers are 
involved. 

In recent crowdsourcing markets, as the heterogeneity of tasks i s inc reas ing, many 
researchers started to focus on heterogeneous crowdsourcing. Goel et ah ( 2014! ) studied 
the problem of incentive-compatible mechanism design for heterogeneous markets. The 
goal is to properly price the tasks for worker trustfulness and maximize the r equester 
utility with the financial constrain. Ho and Vaughan! ( 2012h and Ho et al. ( 2013 1 studied 
the problem of task assignment in heterogeneous crowdsourcing. However, it is another 
variant of problem setting, where workers arrive online and the requester mu st assi gn 
a tas k (or sequence of tasks) to each new worker as she arrives (ISlivkins arid Vaughan! 


2 01 4). While in our problem setting, the requester completely controls which task to pick 


and which worker to select at each step. 

From a technical perspective, the most similar problem setting to ours is that of 
OptKG, where we can determine a task-worker pair at each step. For the purpose of 
extensive comparison, we also include two heuristic methods IEThresh and CrowdSense 
as well as OptKG in our experiments. These three task assignment methods are further 
detailed in Section [H 

In this paper, we propose a contextual bandit formulation for task assignment in het¬ 
erogeneous crowdsourcing. Our method models task-dependent reliability for workers by 
using weight , which depends on the context of a certain task. Here, context can be inter¬ 
preted as the type or required skill of a task. For label aggregation, we adopt weighted 
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voting , which is a common solution used for aggregating noisy labels in crowdsourcing. 
Our method consists of two phases: the pure exploration phase and the adaptive assign¬ 
ment phase. In the pure exploration phase, we explore workers’ reliability in a batch mode 
and initialize their weights as the input of the adaptive assignment phase. On the other 
hand, the adaptive assignment phase includes a bandit-based strategy, where we sequen¬ 
tially sel ect a worker for a given labeling task with the help of the exponential weighting 
scheme ( Cesa-Bianchi and Lugosil 2006; Arora et al. . 201211 . which is a standard tool for 
bandit problems. The whole method runs within a limited budget. Moreover, we also 
investigate the regret bounds of our strategy theoretically, and demonstrate its practical 
usefulness experimentally. 

The rest of this paper is organized as follows. In Section [21 we describe our proposed 
bandit-based task assignment method for heterogeneous crowdsourcing. Then we theo¬ 
retically investigate its regret bounds in Section [3j For the purpose of comparison, we 
look into the details of the existing task assignment methods for crowdsourcing in Sec¬ 
tion Q] and experimentally evaluate them together with the proposed method in Section 
0 Finally, we present our conclusions in Section [HI 


2 Bandit-Based Task Assignment 

In this section, we describe our proposed bandit-based task assignment (BBTA) method. 


2.1 Problem Formulation 


Suppose we have N unlabeled tasks with indices I = {1,..., 2V}, each of which is char¬ 
acterized by a context s from a given context set S, where |«S| = S. Let {y*}^ = i be the 
unknown true labels of tasks, where y* £ { — 1,1}. Each time given a task, we ask one 
worker from a pool of K workers for a (possibly noisy) label, consuming one unit of the 
total budget T. Our goal is to find suitable task-worker assignment to collect as many 
reliable labels as possible within the limited budget T. Finally, we aggregate the collected 
labels to estimate the true labels {y*}f =v 

Let yij be the individual label of task i (with context s t ) given by worker j . If the 
label is missing, we set y t .j = 0. For simplicity, we omit the subscript i of context s*, 
and refer to the context of the current task as s. We denote the weight of worker j for 
context s by Wj (> 0), corresponding to the task-dependent reliability of worker j. Note 
that Wj is what we learn dynamically in the method. Then an estimate of the true label 
y* is calculated by the weighted voting mechanism as 


sr^K s 

. /_\ i _ L, i 

Vi = sign {Vi), where y i = —^-— 

E r =i w f 


( 1 ) 


Our proposed method consists of the pure exploration phase and the adaptive assign¬ 
ment phase. The pseudo code is given in Algorithm [U and the details are explained 
below. 
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Algorithm 1 Bandit-Based Task Assignment (BBTA) 

1: Input: 

2: N: The number of tasks, each of which has a context s from S 

3: K: The number of workers 

4: T : The total budget 

5: N The number of tasks for each context in the pure exploration phase 

6: Initialization: 

7: yij = 0 for i = 1,..., N and j — 1,..., K. 

8: y,| = 0 for i — 1,..., N. 

9: t s = 0 for s 6 5. 

10: % Pure Exploration Phase 

11: Pick N' tasks for each of S distinct contexts. 

12: Collect labels from all workers for these tasks and use Eqn. [2] for label aggregation. 
13: Calculate cumulative losses for j = 1..... A' and s £ 5: 

14: Set the budget as T 2 = T — SKN' and the index set of available tasks as X 2 = 
X \ U se 5 X* for the adaptive assignment phase. 

15: % Adaptive Assignment Phase 
16: for t = 1 to X 2 do 

17: Pick the task with the lowest confidence score X = argmin |yj. 

ieX2 

18: Observe its context s, and update t s t— t s + 1. 

19: Set rfs = and calculate the weights for j — 1,..., K: 

20: Draw the worker j t from the distribution ■ ■ ■ ,PK,t), where 

Pj,t = =« -— ■ 

1 W f,t° 

21: Obtain label y H , ]t and calculate y lt by the weighted voting mechanism (Eqn. [T]). 

22: Set lj ut = 1 y it ^y itjt , and for j = calculate l J[t = j=j t and update 

Lj } t 3 = + lj t f 

23: Update confidence scores |y ?; | of all tasks with context s by using Eqn. |T] 

24: if task i t is already labeled by all workers then 

25: X 2 i X 2 \ {i t }. 

26: end if 
27: end for 

28: Output: iji = sign(yj for i — 1,..., N 
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2.2 Pure Exploration Phase 

Pure exploration performs in a batch mode, and the purpose is to know which workers 
are reliable at which labeling tasks. To this end, we pick N' tasks for each of S distinct 
contexts (SN 1 <C N ) and let all K workers label them (Line 11-12 in Algorithm [TJ) . We 
denote the index set of N' tasks with context s as If in this phase. 

Since we have no prior knowledge of workers’ reliability in this phase, we give all of 
them the same weight when aggregating their labels (equivalent to majority voting): 


1 K 

Vi = sign(^), where y i = — ^Vij- 

3 =1 


( 2 ) 


In the standard crowdsourcing scenario, all of the true labels are usually unknown. As in 
many other crowdsourcing methods, we have the prior belief that most workers perform 
reasonably well. To evaluate an individual label using the weighted vote y % 


is a 


common solution (IDoumez et all 120091; lErtekin et all 120141) . We denote the cumulative 


loss by L S j 0 and initialize it for the next phase as 


Lj ,o ^-yi.j^yn 


for j = 1,..., K and s G S, 


where denotes the indicator function that outputs 1 if condition n holds and 0 oth¬ 
erwise. This means that when a worker gives an individual label y,^ inconsistent or 
consistent with the weighted vote yi, this worker suffers a loss 1 or 0. It is easy to see 
that cumulative losses correspond to workers’ reliability. They are used for calculating 
workers’ weights in the next phase. The budget for the next phase is T 2 — T — 7j, where 
T x = SKN' < T is the budget consumed in this phase. 


2.3 Adaptive Assignment Phase 

In the adaptive assignment phase, task-worker assignment is determined for the remaining 
N — SN 1 tasks in an online mode within the remaining budget T 2 . At each step t of this 
phase, to determine a task-worker pair, we need to further consider which task to pick 
and which worker to select for this task. 

According to the weighted voting mechanism (Eqn. [[]), the magnitude \y,\ (e [0,1]) 
corresponds to the confidence score of y % . For task i, the confidence score |will be 1 
if and only if we have collected labels from all workers and all of them are consistent. 
On the other hand, when the sum of (normalized) weights for positive labels is equal to 
that for negative labels, or we have no labels for task i, the confidence score is 0. 
That is, we are not confident in the aggregated label (i.e. the confidence score is low) 
because the collected labels are significantly inconsistent, or insufficient, or both. If the 
confidence score of a task is lower than those of others, collecting more labels for this 
task is a reasonable solution. Thus we pick task i t with the lowest confidence score as 
the next one to label (Line 17 in Algorithm [TJ) : 


i t = argmin \y t \, 
iei 2 
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where X 2 is the index set o f current available tasks in this phase. This idea is similar to 
uncertainty sampling (Lewis and Gale, 1994) in active learning. 

Given the picked task, selecting a worker reliable at this task is always favored. On 
the other hand, workers’ reliability is what we are dynamically learning in the method. 
There exists a trade-off between exploration (i.e. learning which worker is reliable) and 
exploitation (i.e. selecting the worker considered to be reliable) in worker selection. To 
address this trade-off, we formulate our task assignmen t problem as a multi-armed , band it 
problem, more specifically, a contex tual bandit problem feubeck and Oesa-Bianchl . 2012 b 

Multi-armed bandit problems ( Auer et all 2002a b) are basic examples of sequen¬ 
tial decision making with limited feedback and can naturally handle the exploration- 
exploitation trade-off. At each step, we allocate one unit of budget to one of a set of 
actions and obtain some observable reward (loss) given by the environment. The goal is 
to maximize (minimize) the cumulative reward (loss) in the whole allocation sequence. 
To achieve this goal, we must balance the exploitation of actions that are good in the past 
and the exploration of actions that may be better in the future. A practical extension 
of the basic bandit problem is the contextual bandit problem, where each step is marked 
by a context from a given set. Then the interest is finding good mappings from contexts 
to actions rather than identifying good actions. 

In most contextual bandits, the context (side information) is provided as a feature 
vector which influences the reward/loss at each step, whereas in the heterogeneous crowd¬ 
sourcing setting, we consider the simple situation of contextual bandits, where the context 
is marked by some discrete value, corresponding to the task type. Then our interest is to 
find a good mapping from task types to appropriate workers rather than the best worker 
for all tasks. Note that the task types are observable and provided by the environment. 

The adaptive assignment phase includes our strategy of worker selection, where se¬ 
lecting a worker corresponds to taking an action. The objective is to collect reliable labels 
via the strategy of adaptively selecting workers from the worker pool. 

We further assume the scheme that each worker adopts to assign labels as a black box. 
Then the labeled status of all tasks could abruptly change over time. This means the 
exact sequence of tasks in the adaptive assignment phase is unpredictable, given that we 
calculate confidence scores for all tasks based on their labeled status. Thus we consider 
the task sequence as the external information in this phase. 

Given the task with context s, we calculate the weights (s-dependent) of workers as 
follows (Line 19 in Algorithm [l]) : 


w^ts = exp (- i nt°L a jjt '_ 1 ), for j = 1 ,..., K, 

where t s is the appearance count of context s and rjf is the learning rate related to t s . 
This cal culation of weights by using cumulative losses is du e to the exponential weighting 
scheme (IGesa-Bianchi and Lugosi 120061: lArora et all 1201211 . which is a standard tool for 
sequential decision making under adversarial assumptions. Following the exponential 
weighting scheme, we then select a worker j t from the discrete probability distribution on 
workers with each probability Pj jt proportional to the weight Wj ts (Line 20 in Algor it hm [TJ) . 
Since workers’ cumulative losses do not greatly vary at the beginning but gradually differ 
from each other through the whole adaptive assignment phase, this randomized strategy 
can balance the exploration-exploitation trade-off in worker selection, by exploring more 
workers at earlier steps and doing more exploitation at later steps. 
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Then we ask worker j t for an individual label yi t . 3t and calculate the weighted vote 
yi t by using Eqn. |Tj (Line 21 in Algorithm [1]). With the weighted vote y Hl we obtain 
the loss of the selected worker j t : l 3ut = ^ y , t H (Line 22 in Algorithm Q]) . Recall that 
we do not have any assumption on how each worker gives a label. Then we have no 
stochastic assumption on the generation of losses. Thus we can consider this problem as 
an adversarial bandit problem. Note that we can only observe the loss of the selected 
worker j t , and for other workers, the losses are unobserved. This is called limited feedback 
in bandit problems. Here, we decided to give an unbiased estimate of loss for any worker 
from the above distribution: 


/ - J£ i 

R J=Jf 

Pj,t 


It is easy to see that the expectation of l 3jt with respect to the selection of worker j t is 
exactly the loss lj it . Finally, we update the cumulative losses and the confidence scores 
of tasks with the same context as the current one (Line 22-23 in Algorithm [Tt) . 

The above assignment step is repeated T 2 times until the budget is used up. 


3 Theoretical Analysis 


In this section, we theoretically analyze the proposed bandit-based task assignment 
(BBTA) method. 

The behavior of a bandit strategy is studied by means of regret analysis. Usually, 
the performance of a bandit strategy is compared with that of the optimal one, to show 
the “regret” for not following the optimal strategy. In our task assignment problem, we 
use the notion of regret to investigate how well the proposed strategy can select better 
workers from the whole worker pool, by comparing our strategy with the optimal one. 

The reason why we use regret as the evaluation measure is that the whole task as¬ 
signment process is working in an online mode. From the perspective of a requester, 
the objective is to maximize the average accuracy of the estimated true labels with the 
constraint of the budget. Ideally, this is possible when we have complete knowledge of the 
whole process, e.g. the reliability of each worker and the budget for each context. How¬ 
ever, in the setting of task assignment, since we can not know beforehand any information 
about the worker behaviors and the coming contexts in the future, it is not meaningful 
to try to maximize the average accuracy. Instead, the notion of regret can be used as an 
evaluation measure for the strategy of worker selection in the proposed method, which 
evaluates a relative performance loss compared with the optimal strategy. As a general 
objective for online problems (Hazan. 120111: Shalev-Shwartz, 2012), minimizing the regret 
is a common and reasonable approach to guaranteeing the performance. 

Specifically, we define the regret by 


Rt = max E 


t 2 




t= 1 


where g : S —> {1,..., A'} is the mapping from contexts to workers, s is the context of 
the task at step t, and T 2 = T — SKN' is the budget for the adaptive assignment phase. 
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The regret we defined is the difference of losses between our strategy and the expected 
optimal one. Note that the optimal strategy here is assumed to be obtained in hindsight 
under the same sequence of contexts as that of our strategy. The expectation is taken 
with respect to both the randomness of loss generation and worker selection j t ~ p t , 
where p t is the current worker distribution at step t of our strategy. Since we consider 
the loss generation as oblivious (i.e. the loss lj ut only depends on the current step t), we 
do not take the expectation over all sample paths. 

We further denote Tf as the set of indices for steps with context s, and T| = |7^| as 
the total appearance count of context s in the adaptive assignment phase. The following 
theorem shows the regret bound of BBTA. 


Theorem 1. The regret of BBTA with rj^ s 



is bounded as 


Proof. 


R t < 2 y/(T - SKN')SK In K + 


SN' 2 I In K 


K 


+ SN'. 


Rt = max E 

g-S -{I.M 


) max E 


t 2 




t= 1 




j,t) 


< ( 2 V T ‘2 K In K + 


N' 2 /In K 


sG«S 


+ N' 


< 2a/(T - SKN')SK In K + 


4 V K 

SN' 2 /In K 


K 


+ SN', 


where the first inequality is implied by Lemma [2] shown below and the second one is due 
to Jensen’s inequality and ^ sG<s T 2 s = T 2 . □ 

Now, let us consider the following setting. Suppose there is only one context (5 = 1) 
for all tasks (i.e. in the homogeneous setting). Then we do not have to distinguish 
different contexts. In particular, t s , rjf , wf t „ and Tf are respectively equivalent to 
t, rj t , Ljj, vj j t and T 2 in this setting. For convenience, we omit the superscript s in this 
setting. The regret now is 


R T = max E 

which is the difference of losses between our strategy and the one that always selects the 
best worker in expectation. The following Lemma [2] shows a bound of R T . 

Lemma 2. If there is only one context, then the regret of BBTA with r/ t = is 

bounded as 


t 2 


£&,< - ht) 


t =1 


R' T < 2a/(T — KN')K In K + 


N 


7 2 


In K 


+ N r . 
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The proof of Lemma [2] is provided in Appendix. 

Remark 1. Theorem |T| and Lemma [2] show sub-linear regret bounds with respect to T for 
BBTA, indicating that the performance of our strategy converges to that of the expected 
optimal one as the budget T goes to infinity. 

Remark 2. The pure exploration phase is necessary for obtaining some knowledge of 
workers’ reliability before adaptively assigning tasks, and N' controls the length of the 
pure exploration phase. In Theorem [1] it is easy to see that a larger N' makes the term 
2y/(T — SKN')SK In K (corresponding to the adaptive assignment phase) smaller but 

results in larger + N' (related to the pure exploration phase). Moreover, a 

longer pure exploration phase also consumes a larger proportion of the total budget. The 
pure exploration phase is effective (as we will see in Section E|) , but how effective it can 
be is not only related to the choice of N' (even we could obtain the optimal N' that 
minimizes the bound), but also depends on many other factors, such as the true accuracy 
of workers and how different their labeling performances are from each others, which are 
unfortunately unknown in advance. A practical choice used in our experiments in Sec. [5] 
is to set a very small N' (e.g. N' = 1). 


4 Review of Existing Task Assignment Methods for 
Crowdsourcing 

In this section, we review the existing task assignment methods for crowdsourcing and 
discuss similarities to and differences from the proposed method. 


4.1 IEThresh 


IEThresh (IDonmez et all 120091) builds upon interval estimation (IE) flKaelblind . 1 199ft . 
which is used in reinforcement learning for handling the exploration-exploitation trade-off 
in action selection. Before each selection, the IE method estimates the upper confidence 
interval (UI) for each action a according to its cumulative reward: 

ITT/ \ / \ | ,(n—l)'S(u) 

UI(a) = m(a) + ik - 


n 


where m(a ) and s(a) are the mean and standard deviation of rewards that has been 
received so far by choosing action a, n denotes the number of times a has been selected so 
far, and ti? ^ is the critical value for Student’s f-distribution with n —1 degrees of freedom 


at significance level a/2. In experiments, we set a at 0.05, followingLDonmez et ahj (2009 ). 

In IEThresh, taking the action aj corresponds to selecting worker j to ask for a label. 
The reward is 1 if the worker’s label agrees with the majority vote, and 0 otherwise. At 
each step, a new task arrives, UI scores of all workers are re-calculated, and a subset of 
workers with indices 


{j | UI(aj) > e ■ maxUI(a)} 

a 

are asked for labels. It is easy to see that the pre-determined threshold parameter e 
controls the size of worker subset at each step. 
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4.2 CrowdSense 


Similarly to IEThresh, CrowdSense (lErtekin et al.L 2014) also dynamically samples worker 
subsets in an online mode. The criterion that CrowdSense uses for handling exploration- 
exploitation in worker selection is described as follows. 

Before assigning a new task i to workers, CrowdSense first calculates a quality estimate 
for each worker: 


Qj 


CLj T K 

Cj + 2K ’ 


where Cj is the number of times worker j has given a label, aj represents how many of 
those labels were consistent with the weighted vote, and K is a smoothing parameter. In 
experiments, we set K at 100, which is empirically shown by lErtekin et ajJ (2014) as the 
best choice. Then a confidence score is calculated as 


Score(S'i) = ^ UijQj , 

j&Si 

where S t is a subset of workers initially containing two top-quality workers and another 
random worker. Then CrowdSense determines whether to add a new worker l to S) 
according to the criteria: 


|Score(5))| - Qj 

1*5)1 T 1 

where e is a pre-determined threshold parameter. If the above is true, the candidate 
worker is added to S), and Score(S)) is re-calculated. After fixing the subset of workers 
for task i, CrowdSense aggregates the collected labels by the weighted voting mechanism, 
using Qj as the weight of worker j. 


4.3 Optimistic Knowledge Gradient 


Optimistic Knowledge Gradient (OptKG) (Chen et ah, 12013 ) uses an iV-coin-tossing 


model, formulates the task assignment problem as a Bayesian Markov decision process 
(MDP), and obtains the optimal allocation sequence for any finite budget T via a com¬ 
putationally efficient approximate policy. 

For better illustration, a simplified model with noiseless workers is first considered. 
In the model, task i is characterized by d* drawn from a known Beta prior distribution, 
Beta(a°,6°). In practice, we may simply start from a® = b® = 1 if there is no prior 
knowledge ( Chen et all, 2013). At each stage t with Beta(a*,6*) as the current posterior 


distribution for dj, task i t is picked, and its label, drawn from Bernoulli distribution 
y it ~ Bernoulli (9 it ), is acquired. Then {a\,b\}f =l are put into an n x 2 matrix S l , called 
a state matrix with row i as Sj = (a*, bj). The state matrix is updated according to the 
following rule: 


Qt+l 


+ (e it , 0) 
S t + ( 0 , e it ) 
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if Vi t = f, 

if Vit = -f, 
















where e H is an n-dimensional vector with 1 at the i t - th entry and 0 at all others. The 
state transition probability is calculated as 

p(y<, = 11 S‘,it) = E[0,, | s'] = -rj-rr- 

a it + °i t 

The stage-wise expected reward is defined as 

R{S\it) = PiRi{a\ t ,b\ t ) +p 2 R 2 (a t it ,b t it ), 

where 


Pi =p{Vit = 1 I 

P2 = 1 ~p{y% t = 1 I S t 1 i t ), 

RMA t ) = WK +1 A)) - WKA))’ 

R*«A) = KHAA +!)) - Ki«A t )), 

= P( 9 > °- 5 I 0 ~ Beta(a* t , 
h(x) = max(i, 1 — x). 

Then the following optimization problem is solved: 

T—1 

t =0 

where <S' 0 (,S'°) = Y^i=i h(I(a^, )) • With the above dehnitions, the problem is formulated 

as a T-stage MDP associated with a tuple 


\/(S°) = G 0 (S°)+supE 7r 

7T 


where 


S* 


w. *-;>?,!:«; > - «?> + 


z=l 



and the action space is the set of indices for tasks that could be labeled next: A = 
{!)••■) n }- 

Since the computation of dynamic programming for obtaining the optimal policy of the 
above formulation is intractable, a computationally efficient and consistent approximate 
policy is employed, resulting in the criterion for task selection at each stage t: 


it = arg max (max(i?i(a-, & ■), i? 2 (a*, b\))) • 


Finally, OptKG outputs the positive set 

H r = {i:aJ > bj}. 


Based on the simplihed model above, workers’ reliability can be further introduced. 
Specifically, worker j is characterized by a scalar parameter 


pj = p(yi,j = y* I V*), 


li 



and is assumed drawn from a Beta prior distribution: pj ~ Beta(c°,d°). Then at each 
stage, a pair of worker and task is adaptively selected, acco rding to a new criterion, which 
is derived in Appendix . 5 in t he original OptKG paper (I Chen et al 1120131) . In experiments, 
we follow Chen et ah fi2013l) and set c° = 4 and d® = 1 for each worker, which indicates 
that we have the prior belief that the average accuracy of workers is 4/5 = 80%. 


4.4 Discussion 

Although IEThresh and CrowdSense employ different criterions for dealing with the 
exploration-exploitation trade-off in worker selection, they share the similar mechanism 
of dynamically sampling worker subsets. In particular, at each step, a new task comes, 
workers’ reliability is learned according to their labeling performances on previous tasks, 
and a subset of workers considered to be reliable is sampled for labeling the new task. 
However, in heterogeneous crowdsourcing, a worker in the subset who is reliable at pre¬ 
vious tasks may be bad at new ones. On the other hand, the worker who is good at new 
tasks may have already been eliminated from the subset. Therefore, it is reasonable to 
model the task-dependent reliability for workers, and then match each task to workers 
who can do it best. 

Another issue in IEThresh and CrowdSense is that the exact amount of total budget 
is not pre-fixed. In these two methods, a threshold parameter e is used for controlling 
the size of worker subsets. That is, e determines the total budget. However, how many 
workers in each subset is unknown beforehand. Therefore, we will not know the exact 
amount of total budget until the whole task assignment process ends. This is not so 
practical in crowdsourcing, since in the task assignment problem, we pursue not only 
collecting reliable labels but also intelligently using the pre-fixed budget. Moreover, both 
of IEThresh and CrowdSense lack of theoretical analyses about the relation between the 
budget and the performance. 

OptKG and the proposed method (BBTA) attempt to intelligently use a pre-fixed bud¬ 
get, and ask one worker for a label of the current task at each step. OptKG formulates 
the task assignment problem as a Bayesian MDP, and is proved to produce a consis¬ 
tent policy in homogeneous worker setting (i.e. the policy will achieve 100% accuracy 
almost surely when the total budget goes to infinity). However, when the heterogeneous 
reliability of workers is introduced, more sophisticated approximation is also involved 
in OptKG, making it difficult to give theoretical analysis for OptKG in heterogeneous 
worker setting. On the other hand, BBTA is a contextual bandit formulation designed 
for heterogeneous crowdsourcing, and the regret analysis demonstrates the performance 
of our task assignment strategy will converge to that of the optimal one when the total 
budget goes to infinity. 


5 Experiments 

In this section, we experimentally evaluate the usefulness of the proposed bandit-based 
task assignment (BBTA) method. To compare BBTA with existing methods, we first 
conduct experiments on benchmark data with simulated workers, and then use real data 
for further comparison. All of the experimental results are averaged over 30 runs. 
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(a) TV = 351, if = 30, S' = 3 


(b) N = 569, K = 40, S = 4 (c) TV = 768, K = 50, S = 5 


Figure 1: Distribution of true accuracy of simulated workers for three benchmark datasets 
with three worker models. 


5.1 Benchmark Data 

We perform experiments on three popular UCI benchmark dataset^]: ionosphere (TV = 
351), breast (TV = 569), and pirna (TV = 768). We consider instances in these datasets as 
labeling tasks in crowdsourcing. True labels of all tasks in these datasets are available. 
To simulate various heterogeneous cases in the real world, we first use k-means to clus¬ 
ter these three datasets into S — 3,4, 5 subsets respectively (corresponding to different 
contexts). Since there are no crowd workers in these datasets, we then simulate work¬ 
ers (K = 30,40,50, respectively) by using the following worker models in heterogeneous 
setting: 


Spammer-Hammer Model 


A hammer gives true labels, while a spammer gives random labels (Karge r et ah 


20111. We introduce this model into heterogeneous setting: each worker is a hammer 


on one subset of tasks but a spammer on others. 


One-Coin Model 

Each worker gives true labels with a given probability (i.e. 
is widely used in many existing crowdsourcing literatures 


accu racy) . This model 


e.g. 


Raykar et ah, 2010 


Chen et ah, 2013) for simulating workers. We use this model in heterogeneous 


setting: each worker gives true labels with higher accuracy (we set it to 0.9) on one 
subset of tasks, but with lower accuracy (we set it to 0.6) on others. 


One-Coin Model (Malicious) 

This model is based on the previous one, except that we add more malicious labels: 
each worker is good at one subset of tasks (accuracy: 0.9), malicious or bad at 
another one (accuracy: 0.3), and normal at the rest (accuracy: 0.6). 


With the generated labels from simulated workers, we can calculate the true accuracy 
for each worker by checking the consistency with the true labels. Figure |T] illustrates the 
counts of simulated workers with the true accuracy falling in the associated interval (e.g., 
0.65 represents that the true accuracy is between 60% and 65%). It is easy to see that 

2 http://archive. ics.uci.edu/ml/ 
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(a) N = 351, K = 30,S = 3 




(b) N = 569, K = 40, S = 4 (c) N = 768, K = 50, S = 5 




(d) N = 351, K = 30, S = 3 (e) N = 569, K = 40, S = 4 



(f) N = 768, if = 50, S' = 5 



(g) N = 351, K = 30,S = 3 




(h) N = 569, K = 40, S = 4 (i) N = 768, K = 50, S = 5 


Figure 2: Comparison results on three benchmark datasets with three worker models. 


the spammer-hammer model and the one-coin model (malicious) create more adversarial 
environments than the one-coin model. 

We compare BBTA with three state-of-the-art task assignment methods: IEThresh, 
CrowdSense and OptKG in terms of accuracy. Accuracy is calculated as the proportion 
of correct estimates for true labels. Also, since the budget is limited, we expect a task 
assignment method can achieve its highest accuracy as fast as possible as the budget 
increases (high convergence speed). We set N' = 0 and 1 for BBTA to see the effectiveness 
of the pure exploration phase. We also implement a naive baseline for comparison: we 
randomly select a task-worker pair at each step, and use majority voting mechanism for 
label aggregation. Accuracy of all methods is compared at different levels of budget. For 
BBTA, OptKG, and the naive baseline, we set the maximum amount of budget at T = 
15N. Since the budget is not pre-fixed in IEThresh and CrowdSense, we carefully select 
the threshold parameters for them, which affect the consumed budgets. Additionally, we 
also try to introduce the state-of-the-art methods (designed for the homogeneous setting) 
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(d) N = 351, if = 30, S' = 3 







Figure 3: Results of changing N' on three benchmark datasets with three worker models. 


into the heterogeneous setting. Specifically, we split the total budget and allocate a sub¬ 
budget to a context in proportion to the number of tasks with this context. In particular, 
for context s, we allocate the sub-budget T ■ |tasks with context s\/N. Then we can run 
an instance of a homogeneous method for each context within the associated sub-budget. 
Since OptKG has the most similar problem setting to that of the proposed method, it is 
straightforward to run multiple instances of OptKG with a pre-fixed sub-budget for each 
context. However, for the two heuristic methods IEThresh and CrowdSense, it is difficult 
to figure out how to use them in this way, since the budget could not be pre-determined 
in their settings. 

Figure [2] shows the averages and standard errors of accuracy as functions of budgets 
for all methods in nine cases (i.e. three datasets with three worker models). As we 
can see, BBTA with N' — 1 works better than that with N' = 0, indicating that the 
pure exploration phase helps in improving the performance. It is also shown that BBTA 
(N f = 1) outperforms other methods in all six cases with the spammer-hammer model and 
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the one-coin model (malicious). This demonstrates that BBTA can handle spamming or 
malicious labels better than others in more adversarial heterogeneous environments. For 
the three cases with the one-coin model where there are more reliable labels, almost all 
methods have good performances. Nevertheless, IEThresh performs poorly in all cases, 
even worse than the naive baseline. The reason is that IEThresh samples a subset of 
reliable workers at each step, by calculating upper confidence intervals of workers based 
on their labeling performances on previous tasks. In heterogeneous settings, however, 
a worker reliable at previous tasks may be poor at next ones. This makes IEThresh 
learn workers’ reliability incorrectly, resulting in poor sampling of worker subsets. Al¬ 
though CrowdSense also adopts the mechanism of dynamically sampling worker subsets, 
its exploration-exploitation criterion gives it a chance of randomly selecting some work¬ 
ers who may be reliable at next tasks. For OptKG, not surprisingly, OptKG (Multi.) 
which is aware of contexts outperforms the original OptKG. The tendency of either of 
them implies that they may achieve the best accuracy as the budget goes to infinity, but 
the convergence speed is shown to be slower than those of BBTA and CrowdSense. In 
crowdsourcing, it is important to achieve the best accuracy as fast as possible, especially 
when the budget is limited. 

We then run BBTA with changing N', to see how N' affects the performance. We 
set N' = 0,1, 5,10, and the results are shown in Figure [3j It can be seen that without 
the pure exploration phase (i.e. N' = 0), the performance is the worst in all nine cases. 
On the other hand, when we add the pure exploration phase (TV 7 > 0), the performance 
is improved. However, we are unable to conclude that the larger N' is, the better the 
performance is (e.g. N 1 = 10 does not always make the method achieve its highest 
accuracy fastest in all nine cases). Indeed, a larger N' means a longer pure exploration 
phase, which consumes a larger proportion of the total budget. For example, when 
N' = 10, the performance usually starts from a lower accuracy level than that when 
we choose other exploration lengths. Although its start level is lower, as the budget 
increases, the performance when N' = 10 can outperform all the others in most cases of 
the spammer-hammer and one-coin (malicious) models. However, it can only achieve the 
same level as the performance when N' — 5 in all cases of the one-coin model, but with 
a lower convergence speed. In BBTA, we can only choose N 1 to affect the performance, 
and there are also some other factors such as the true reliability of workers and how 
different their labeling performances are from each others, of which we usually do not 
have prior knowledge in real-world crowdsourcing. If we could somehow know beforehand 
that the worker pool is complex (in terms of the difference of workers’ reliability) as in 
the spammer-hammer and one-coin (malicious) models, setting a larger N' may help, 
otherwise a practical choice would be to set a small N'. 

5.2 Real Data 

Next, we compare BBTA with the existing task assignment methods on two real-world 
datasets. 
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(a) N = 800, K = 164,5=1 
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(b) N = 204, K = 42,5 = 2 


Figure 4: Distribution of true accuracy of workers for two real-world datasets. 


5.2.1 Recognizing Textual Entailment 


We first use a real dataset from recognizing textual ent ailme nt (RTE) tasks in natural 
language processing. This dataset is collected by Snow et ah (2008) using Amazon Me¬ 
chanical Turk (MTurk). For each RTE task in this dataset, the worker is presented with 
two sentences and given a binary choice of whether the second sentence can be inferred 
from the first one. The true labels of all tasks are available and used for evaluating the 
performances of all task assignment methods in our experiments. 

In this dataset, there is no context information available, or we can consider all tasks 
have the same context. That is, this is a homogeneous dataset (S = 1). The numbers 
of tasks and workers in this dataset are N = 800 and K = 164 respectively. Since 
the originally collected label set is not complete (i.e. not every worker gives a label for 
each task), we decided to use a matrix completion algorithm^ to fill the incomplete label 
matrix, to make sure that we can collect a label when any task is assigned to any worker 
in the experiments. Then we calculate the true accuracy of workers for this dataset, as 


illustrated in Figure 4(a 


Figure 5(a) depicts the comparison results on the RTE data, showing that all methods 
work very well. The reason is that there is a significant proportion of reliable workers in 
this dataset as we can see in Figure 4(a), and Ending them out is not a difficult mission 


for all methods. It is also shown in Figure 5(a) that BBTA with N' — 1 converges to the 


highest accuracy slightly faster than others. This is important in practice especially in 
the budget-sensitive setting, because achieving higher accuracy within a lower budget is 
always favored from the perspective of requesters in crowdsourcing. 


5.2.2 Gender Hobby Dataset 


The second real dataset we use is Gender Hobby (GH) collected from MTurk by Mo et ah 


( 20131 ). Tasks in this dataset are binary questions that are explicitly divided into two 


3 We use GROUSE (|Balzano et alll20ict) for label matrix completion in our experiments. 
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Figure 5: Comparison results on two real-world datasets. 


contexts (S = 2): sports and makeup-cooking. This is a typical heterogeneous dataset, 
where there are N = 204 tasks (102 per context) and K = 42 workers. Since the 
label matrix in the original GH data is also incomplete, we use the matrix completion 
algorithm again to fill the missing entries. Figure 4(b) illustrates the distribution of the 
true accuracy of workers in this dataset. It is easy to see that the labels given by the 
workers in this dataset are more malicious than those in the RTE data (Figure 4(a)) due 
to the increased diversity of tasks. 

Figure 5(b) plots the experimental results, showing that BBTA with N' — 0 and 
N' — 1 outperform others on this typical heterogeneous dataset. 


6 Conclusions and Future Work 

In this paper, we proposed a contextual bandit formulation to address the problem of task 
assignment in heterogeneous crowdsourcing. In the proposed method called the bandit- 
based task assignment (BBTA), we first explored workers’ reliability and then attempted 
to adaptively assign tasks to appropriate workers. 

We used the exponential weighting scheme to handle the exploration-exploitation 
trade-off in worker selection and utilized the weighted voting mechanism to aggregate the 
collected labels. Thanks to the contextual formulation, BBTA models the task-dependent 
reliability for workers, and thus is able to intelligently match workers to tasks they can do 
best. This is a significant advantage over the state-of-the-art task assignment methods. 
We experimentally showed the usability of BBTA in heterogeneous crowdsourcing tasks. 

We also theoretically investigated the regret bounds for BBTA. In the regret analysis, 
we showed the performance of our strategy converges to that of the optimal one as the 
budget goes to infinity. 

Heterogeneity is practical and important in recent real-world crowdsoucing systems. 
There is still a lot of room for further work in heterogeneous setting. In particular, we 
consider four possible directions: 
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• In practical crowdsourcing, categorical labeling tasks with more than two classes 
are also common besides binary ones. Then we can extend the current method to 
the categorical setting, where the weighted voting mechanism turns to 

K 

yi = arg max > w\ ■ • \ yi , =c , where C is the categorical label space. 

rPC Zj ' 3 

eec j=1 

• The similarities among tasks with different types (corresponding to the information 
sharing among different contexts) can also be considered, since real-world workers 
may have similar behaviors on different types of tasks. 

• To further improve the reliability of the crowdsourcing system, we can involve the 
supervision of domain experts, and adaptively collect labels from both workers and 
experts. In such a scenario, we need to cope with the balance between the system 
reliability and the total budget, since expert labels (considered as the ground truth) 
are much more expensive than worker labels. 

• From a theoretical point of view, the context and loss could be considered to be 
dependent on the history of actions, although reasonably modeling this dependence 
relation is challenging in the setting of heterogeneous crowdsourcing. The current 
method considered the sequence of contexts and losses as external feedback, and 
thus adopted a standard bandit formulation. If we could somehow appropriately 
capture the dependence relation mentioned above, it would be possible to further 
improve the current theoretical results. 

In our future work, we will further investigate the challenging problems in heteroge¬ 
neous crowdsourcing. 
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Appendix 

A Proof of Lemma 12 


This p roof is following the framework of regret analysis for adversarial bandits by Bubeck 

feoioh . 


Recall that the unbiased estimator of loss is 
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where the first and second inequalities are due to lnx < x — 1 for x > 0 and exp(x) < 
1 + x + x 2 /2 for x < 0 respectively, and the third inequality is because 
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where the first and third inequalities are due to In x < x — 1 for x > 0 and exp(rr) < 
1 + x + x 2 /2 for x < 0 respectively, the fifth line of (TT0]i is obtained by the definition of 
the cumulative loss L j 0 in the pure exploration phase, the fourth inequality of (ITU]) is an 
immediate result of using Jensen’s inequality and \X\ \ = N', and the last line of (ITU]) is 
because we use majority voting in the pure exploration phase. We also have 
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Since it is shown by Bnbeck (2010) that $'( 77 ) > 0 and we set 7 j t = we have 
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where the second inequality is because T 2 = T — KN' and 


(H3 shows the bound about our strategy competing against any strategy that always 
selects one single worker, thus concluding the proof. 
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